Semantic and Generative Models for Lossy Text Compression

نویسندگان

  • Ian H. Witten
  • Timothy C. Bell
  • Alistair Moffat
  • Craig G. Nevill-Manning
  • Tony C. Smith
  • Harold W. Thimbleby
چکیده

The complementary paradigms of text compression and image compression suggest that there may be potential for applying methods developed for one domain to the other. In image coding, lossy techniques yield compression factors that are vastly superior to those of the best lossless schemes, and we show that this is also the case for text. This paper investigates the resulting tradeoff between subjective quality of the transmission and its compression factor. Two different methods are described, which can be combined into an extremely effective technique that provides far better compression than the present state of the art and yet preserves a reasonable degree of perceived match between the original and received text. The major challenge for lossy text compression is the quantitative evaluation of the quality of this match. Introduction We have been struck by the apparent divergence between the research paradigms of text and image compression [1, 2], despite the fact that both are concerned with compressing information whose subjective quality must be recoverable. Schemes for text compression are invariably reversible or lossless, whereas although there certainly exist lossless methods of image compression, much research effort addresses irreversible or lossy techniques such as transform coding, vector quantization, and fractal approximation. The divergence between the text and image paradigms is unfortunate because the opportunity for symbiosis between the two approaches is lost, and advances in one domain have negligible impact on the other. Although there are superficial reasons why one might choose to neglect the topic of lossy text compression — such as the difficulty of evaluating the quality of the regenerated message — in this paper we suggest that a great deal can be gained by taking seriously the idea of approximate compression of text. CONTRIBUTIONS AND STRUCTURE OF THE PAPER We have developed two novel techniques for lossy compression of text, and they are described in sufficient detail for our work to be replicated and tested in realistic compression situations. The necessarily very short texts that are used in the examples exhibited in this paper, along with their statistical limitations, can be no more than indicative of the underlying power of the techniques we describe. †Address for correspondence. Phone (+44) 786 467679; fax 786 467641; [email protected].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Ambientgan: Generative Models from Lossy Measurements

Generative models provide a way to model structure in complex distributions and have been shown to be useful for many tasks of practical interest. However, current techniques for training generative models require access to fully-observed samples. In many settings, it is expensive or even impossible to obtain fullyobserved samples, but economical to obtain partial, noisy observations. We consid...

متن کامل

Model-Based Semantic Compression for Network-Data Tables

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteedquality...

متن کامل

Improvement of generative adversarial networks for automatic text-to-image generation

This research is related to the use of deep learning tools and image processing technology in the automatic generation of images from text. Previous researches have used one sentence to produce images. In this research, a memory-based hierarchical model is presented that uses three different descriptions that are presented in the form of sentences to produce and improve the image. The proposed ...

متن کامل

CS 2980 : Model - based Semantic Compression in Database Project Report

We are now in big data era, advances in data collection and management technologies have led to large databases. For example, domains such as Medicine, Biology, Music and experimental sciences in general, are all characterized by large data sequences. Considering the amount of space the big data required and the amount of IO required to fetch the big data, data compression has become an efficie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Comput. J.

دوره 37  شماره 

صفحات  -

تاریخ انتشار 1994